Skip to content

Conversation

@JeremyDahlgren
Copy link
Contributor

@JeremyDahlgren JeremyDahlgren commented Apr 29, 2025

Adds the ForkJoinPool.commonPool-worker- prefix to the Thread getName().startsWith() checks in
HdfsClientThreadLeakFilter to pick up .commonPool-worker-N threads as well. This resolves the
"There are still zombie threads that couldn't be terminated" errors in the Hdfs IT tests:

    com.carrotsearch.randomizedtesting.ThreadLeakError: 4 threads leaked from SUITE scope at org.elasticsearch.repositories.hdfs.HdfsTests: 
       1) Thread[id=53, name=ForkJoinPool.commonPool-worker-1, state=WAITING, group=TGRP-HdfsTests]
            at java.base/jdk.internal.misc.Unsafe.park(Native Method)
            at java.base/java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:2143)
            at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2034)
            at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:187)

Closes #127290
Closes #127289
Closes #127288
Closes #127287

Changes "ForkJoinPool-" to "ForkJoinPool." in the
Thread getName().startsWith() checks in
HdfsClientThreadLeakFilter.  This resolves the
"There are still zombie threads that couldn't be terminated"
errors in the Hdfs IT tests.

Closes elastic#127290
Closes elastic#127289
Closes elastic#127288
Closes elastic#127287
@JeremyDahlgren JeremyDahlgren added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.0 labels Apr 29, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@ywangd
Copy link
Member

ywangd commented Apr 30, 2025

Out of curiosity, do you know why the test started failing recently? The hadoop version is the same between 9.0 and 9.1. So I wonder what else has change to cause the failures?

@nicktindall
Copy link
Contributor

Out of curiosity, do you know why the test started failing recently? The hadoop version is the same between 9.0 and 9.1. So I wonder what else has change to cause the failures?

They all seem to be openjdk21 builds, I wonder if the JDK changed the name of the ForkJoinPool threads?

@ywangd
Copy link
Member

ywangd commented Apr 30, 2025

Does not look like JDK version related to me since the tests do not fail with JDK21 and 8.18.

@nielsbauman nielsbauman removed their request for review April 30, 2025 06:10
@ldematte
Copy link
Contributor

I've looked back at the JDK source and the ForkJoinPool.commonPool-worker prefix has been like this since at least JDK 17, so I also wonder why tests started failing now -- maybe we tightened the checks? Or maybe we (hadoop) were not leaking any ForkJoinPool thread before?

Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider using ForkJoinPool.commonPool-worker- as a prefix, to be consistent with the JDK/ with our own code (see TestContainersThreadFilter), but in any case this looks OK to me.

@nicktindall
Copy link
Contributor

nicktindall commented May 1, 2025

It looks like for some reason we've moved from using a specific ForkJoinPool instance to using the "common fork join pool"

See that when you create a ForkJoinPool with a specific config, the workerNamePrefix gets set to ForkJoinPool-{pid}-worker-..., which matches the existing pattern.

But the common fork join pool does not have a workerNamePrefix, which causes it to default to ForkJoinPool.commonPool-worker-.... This deviates from the existing pattern.

I tried to reproduce it locally but I can't seem to get leaks from any fork join pool, so it's hard to pin down where hadoop is using it and why that behaviour changed.

Hadoop seems to use both kinds of ForkJoinPool (common and specific) fairly extensively

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have an explanation for why this started failing recently, but being the common fork join pool those threads really could be coming from anywhere.

LGTM unless anyone thinks it's worth holding off until we have an explanation?

@JeremyDahlgren
Copy link
Contributor Author

It would be nice to have an explanation for why this started failing recently, but being the common fork join pool those threads really could be coming from anywhere.

I trimmed down HdfsTests.testSimpleWorkflow() to just creating the repository:

    public void testSimpleWorkflow() {
        Client client = client();
        assertAcked(
            client.admin()
                .cluster()
                .preparePutRepository(TEST_REQUEST_TIMEOUT, TEST_REQUEST_TIMEOUT, "test-repo")
                .setType("hdfs")
                .setSettings(
                    Settings.builder()
                        .put("uri", "hdfs:///")
                        .put("conf.fs.AbstractFileSystem.hdfs.impl", TestingFs.class.getName())
                        .put("path", "foo")
                        .put("chunk_size", randomIntBetween(100, 1000) + "k")
                        .put("compress", randomBoolean())
                )
        );

This fails with the commonPool threads leaked for JDK 22 and 23, but not for JDK 21 and 24 on my machine:

for JDK in 21 22 23 24 ; do ./gradlew ":plugins:repository-hdfs:test" --tests "org.elasticsearch.repositories.hdfs.HdfsTests.testSimpleWorkflow" -Druntime.java=$JDK ; done

Per the ForkJoinPool.commonPool() docs:

This pool is statically constructed; its run state is unaffected by attempts to `shutdown()` or `shutdownNow()`. However this pool and any ongoing processing are automatically terminated upon program `System.exit(int)`.

So it seems reasonable to filter the commonPool threads. But yes, the question remains what the difference is running between these JDKs.

@nicktindall
Copy link
Contributor

So it seems reasonable to filter the commonPool threads. But yes, the question remains what the difference is running between these JDKs.

I think it's a safe thing to suppress. The use of the common pool could be coming from the JVM libraries. You could stick a breakpoint in the lazy initialisation perhaps? that might reveal where it's coming from. But if it's inconclusive I don't know if it's worth spending a lot of time on?

@JeremyDahlgren
Copy link
Contributor Author

JeremyDahlgren commented May 2, 2025

So it seems reasonable to filter the commonPool threads. But yes, the question remains what the difference is running between these JDKs.

I think it's a safe thing to suppress. The use of the common pool could be coming from the JVM libraries. You could stick a breakpoint in the lazy initialisation perhaps? that might reveal where it's coming from. But if it's inconclusive I don't know if it's worth spending a lot of time on?

The ForkJoinPool.commonPool threads are created from down in Hadoop statistics that use parallelStream(), roughly from running in the debugger with a JDK 21:

HdfsBlobContainer.writeToPath():265-267:

  • has a try/finally that closes the FSDataOutputStream,
  • that closes the BufferedIOStatisticsOutputStream,
  • that closes the RawLocalFileSystem$LocalFSFileOutputStream,
  • that has a finally block that cals IOStatisticsAggregator.aggregate()
  • which calls IOStatisticsSnapshot.aggregate(), which does several calls to
  • IOStatisticsBinding.aggregateMaps() that calls
  • EvaluatingStatisticsMap.entrySet() has evalEntries.parallelStream().map(...).collect(...)
    —————— JDK —————
  • ReferencePipeline.collect()
  • AbstractPipeline.evaluate()
  • ReduceOps.evaluateParallel()
  • ForkJoinTask.invoke()
  • ForkJoinTask.doExec()
  • CountedCompleter.exec()
  • AbstractTAsk.compute()
  • ForkJoinTask.fork()
  • DefaultForkJoinWorkerThreadFactory.newCommonWithACC(pool)
  • InnocuousForkJoinWorkerThread
  • ForkJoinWorkerThread()
  • createWorker()

Note that ForkJoinPool.commonPool-worker- is also filtered in TestContainersThreadFilter.

Also it looks like there is a change in DefaultForkJoinWorkerThreadFactory.newThread() between 21 and 24 relating to the security manager. The timing of the test failures lines up with other test failures and updates going on in core infra with security manager and entitlements changes.

@JeremyDahlgren JeremyDahlgren merged commit 4408e38 into elastic:main May 2, 2025
17 checks passed
@JeremyDahlgren JeremyDahlgren added v8.19.0 auto-backport Automatically create backport pull requests when merged and removed auto-backport Automatically create backport pull requests when merged labels May 4, 2025
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this pull request May 4, 2025
…#127534)

Adds the ForkJoinPool.commonPool-worker- prefix to the
Thread getName().startsWith() checks in HdfsClientThreadLeakFilter.
This resolves the
"There are still zombie threads that couldn't be terminated"
errors in the Hdfs IT tests.

Closes elastic#127290
Closes elastic#127289
Closes elastic#127288
Closes elastic#127287

(cherry picked from commit 4408e38)
@JeremyDahlgren
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.19

Questions ?

Please refer to the Backport tool documentation

JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this pull request May 4, 2025
…#127534)

Adds the ForkJoinPool.commonPool-worker- prefix to the
Thread getName().startsWith() checks in HdfsClientThreadLeakFilter.
This resolves the
"There are still zombie threads that couldn't be terminated"
errors in the Hdfs IT tests.

Closes elastic#127290
Closes elastic#127289
Closes elastic#127288
Closes elastic#127287

(cherry picked from commit 4408e38)
@JeremyDahlgren
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
9.0

Questions ?

Please refer to the Backport tool documentation

JeremyDahlgren added a commit that referenced this pull request May 4, 2025
…127683)

Adds the ForkJoinPool.commonPool-worker- prefix to the
Thread getName().startsWith() checks in HdfsClientThreadLeakFilter.
This resolves the
"There are still zombie threads that couldn't be terminated"
errors in the Hdfs IT tests.

Closes #127676
Closes #127677
Closes #127678
Closes #127679

(cherry picked from commit 4408e38)
JeremyDahlgren added a commit that referenced this pull request May 4, 2025
…127684)

Adds the ForkJoinPool.commonPool-worker- prefix to the
Thread getName().startsWith() checks in HdfsClientThreadLeakFilter.
This resolves the
"There are still zombie threads that couldn't be terminated"
errors in the Hdfs IT tests.

(cherry picked from commit 4408e38)
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request May 12, 2025
…#127534)

Adds the ForkJoinPool.commonPool-worker- prefix to the
Thread getName().startsWith() checks in HdfsClientThreadLeakFilter.
This resolves the
"There are still zombie threads that couldn't be terminated"
errors in the Hdfs IT tests.

Closes elastic#127290
Closes elastic#127289
Closes elastic#127288
Closes elastic#127287
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this pull request May 22, 2025
…#127534)

Adds the ForkJoinPool.commonPool-worker- prefix to the
Thread getName().startsWith() checks in HdfsClientThreadLeakFilter.
This resolves the
"There are still zombie threads that couldn't be terminated"
errors in the Hdfs IT tests.

Closes elastic#127290
Closes elastic#127289
Closes elastic#127288
Closes elastic#127287

(cherry picked from commit 4408e38)
@JeremyDahlgren
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.18

Questions ?

Please refer to the Backport tool documentation

JeremyDahlgren added a commit that referenced this pull request May 22, 2025
#128309)

Adds the ForkJoinPool.commonPool-worker- prefix to the
Thread getName().startsWith() checks in HdfsClientThreadLeakFilter.
This resolves the
"There are still zombie threads that couldn't be terminated"
errors in the Hdfs IT tests.

Closes #128305
Closes #128306
Closes #128307
Closes #128308

(cherry picked from commit 4408e38)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v8.19.0 v9.0.2 v9.1.0

Projects

None yet

5 participants